Visualizing Data
The presentation of data in a pictorial or graphical format.
The most important but dangerous element of data analytics.
Data Visualization Tips
There are a few basic concepts that can help you generate the best visuals for displaying your data:
ggplot2
library (ggplot2)
# library(tidyverse)
Most robust and versatile
Based on the “Grammar of Graphics”
Plots are built up in layers
Plot Ingredients
Data
Mapping: maps variables to plot elements
Geometrics: points, lines, boxes, histograms, bars, etc.
Scales: controls the mapping of the values in data space to values in aesthetic space
Guides: controls how visual properties are mapped back to the data space
Labels: axis, legend, titles
Themes: visual themes for the plot.
The Big 3
Only 3 ingredients are required to make a plot.
Data
Mapping / Aesthetics
A “geom”
?ggplot ()
ggplot (data = NULL , mapping = aes (), ..., environment = parent.frame ())
1. Data
Always begin with the main function in ggplot2: ggplot
**Data are specified via the “data” argument:
This argument supplies a coordinate system to add layers to.
2. Aesthetics
aes() maps variables from a data set to various elements of a plot
Discrete values (groups / categories) can have color , shape , linetype , or fill mappings.
Points can have an additional x and y position mappings.
Mappings go into the aes() function as the 2nd argument in ggplot().
ggplot (data= df, aes (x= V1, y= V2, color= V3))
Any part of the plot related to the data goes in aes()
3. Geoms
-geoms are the type of geometrics in your plot.
Common geoms include:
geom_boxplot()
geom_histogram()
geom_line()
geom_density()
geom_bar()
geom_point()
ETC
IMPORTANT
ggplot() is built in layers
Use the + operator to add layers to the exisiting ggplot() object.
In this way, your code is explicit about which layers are added and in what order.
ggplot (data= mydata, aes (x= V1, y= V2, color= V3)) + geom_point ()
Have Data?
Variation in Design
To build your plots layer by layer, you use a continuous combination of geoms:
ggplot (mydata, aes (x, y)) + geom_point () + geom_line ()
PSA:
mydata %>% ggplot (aes (x, y)) + geom_point () + geom_line ()
Adding layer by layer:
my_plot <- ggplot (df, aes (x, y))
my_plot <- my_plot + geom_point ()
my_plot <- my_plot + geom_line ()
Printing Plots
You do not need to create an object for the plot:
ggplot (data= df, aes (x= V1, y= V2, color= V3)) + geom_point ()
BUT you can assign your plot to a variable…
my_plot <- ggplot (df, aes (x, y)) + geom_point ()
…and then print / view your plot
Building Common Visualizations
Boxplots
Visualize the distribution of continuous variables by plotting its five-number summary:
Minimum
25th percentile
Median (50th percentile)
75th percentile
Maximum
Boxplots
One continuous variable and one discrete variable
gol <- howells[howells$ Population == 'ARIKARA' | howells$ Population == 'HAINAN' | howells$ Population == 'NORSE' , ]
ggplot (gol, aes (x= Population, y= GOL)) + geom_boxplot ()
Boxplots: Discrete Colors
Discrete variables can also be used to differentiate plot elements by including in the aes() function
ggplot (gol, aes (x= Population, y= GOL, color= Sex)) + geom_boxplot ()
Boxplots: Discrete Colors
Discrete variables can also be used to differentiate plot elements by including in the aes() function
ggplot (gol, aes (x= Population, y= GOL, fill= Sex)) + geom_boxplot ()
Histograms, Density Plots, and Bar Plots
One vector / column of you data
Histograms
Divides the range of scores into a specified number of “bins” on the x-axis and displays the count on the y-axis
ggplot (faithful, aes (x= waiting)) + geom_histogram ()
Histograms: binwidth, fill, and color
ggplot (faithful, aes (x= waiting)) + geom_histogram (binwidth= 5 , fill= "white" , color= "black" )
Histograms: Faceting
facet_grid() separates (“facets”) the plots by rows, columns, or both.
General format: ROWS ~ COLUMNS
Note: [ . ] means “not the dimension.”
library (MASS)
data (birthwt)
ggplot (birthwt, aes (x= bwt)) + geom_histogram (fill= "white" , color= "black" ) + facet_grid (smoke~ .) + mytheme
Density Plots
“Nonparametric method for estimating the probability density function of a random variable”
AKA it gives you the proportion instead of the count
Results in a smoothed line
ggplot (faithful, aes (x= waiting)) + geom_density ()
Density Plots: Sensitivity
ggplot (faithful, aes (x= waiting)) + geom_density (adjust= 0.25 )
Density Plots: Fills
geom_density() can be filled with a color.
NOTE: alpha is used to indicate transparency
0 == transparent.
1 == opaque
ggplot (faithful, aes (x= waiting)) + geom_density (fill= "blue" , alpha= 0.5 )
Density Plots: Factors
Just like other plots we have seen, a discrete variable (factor) can be used to separate groups
ggplot (birthwt, aes (x= bwt, fill= factor (smoke))) + geom_density (color= NA , alpha= 0.5 )
Adding Layers
# Note the use of after_stat() this scales the histogram with the density. Remember, histograms are counts.
ggplot (faithful, aes (x= waiting, y= after_stat (density))) +
geom_histogram ( fill= "white" , color= "black" ) + geom_density (fill= "steelblue" , alpha= 0.4 )
Bar Plots
Displays the counts of discrete/ordinal variables.
ggplot (diamonds, aes (x= cut)) + geom_bar ()
Bar Plots: Bar Width
ggplot (diamonds, aes (x= cut)) + geom_bar ()
ggplot (diamonds, aes (x= cut)) + geom_bar (width = 0.1 )
ggplot (diamonds, aes (x= cut)) + geom_bar (width = 1 )
Bar Plots: Factors
ggplot (diamonds, aes (x= cut, fill= color)) + geom_bar ()
ggplot (diamonds, aes (x= cut, fill= color))+ geom_bar (position= "dodge" )
Line Plots
Visualizes how one variable on the y-axis changes in relation to changes in the x-axis
Can represent discrete(categorical) or continuous(numeric) variables on the x-axis
geom_line example
ggplot (BOD, aes (x= Time, y= demand)) + geom_col ()
ggplot (BOD, aes (x= Time, y= demand)) + geom_line ()
ggplot (BOD, aes (x= Time, y= demand)) + geom_line () + geom_point ()
Change Point Characters
The pch command changes the point type.
ggplot (BOD, aes (x= Time, y= demand)) + geom_line (lty= 2 ) + geom_point (pch= 7 )
Group by Linetype
ggplot (ToothGrowth, aes (x= dose, y= len, lty= supp)) + geom_line ()
Size
Use size to adjust the size of the elements.
tg <- ToothGrowth %>% group_by (dose, supp) %>% summarize (len = mean (len))
ggplot (tg, aes (x= dose, y= len, shape= supp, lty= supp)) + geom_line (lwd= 2 ) + geom_point (size= 6 )
Scatterplots
Scatterplots
* Bi-variate scatter plots help you visualize relationships between two quantitative / continuous variables * When there are additional variables being explored you can use a scatterplot matrix
* Helps identify outliers * Helps identify multicollinearity * Includes stat_functions (linear and loess lines) * Can also incorporate boxplots / histograms / rug plots
library (MASS)
bw <- birthwt %>% dplyr:: select (age, lwt, smoke, bwt)
# create labelled factors
unique (bw$ smoke)
bw$ smoke <- factor (bw$ smoke, levels= c (0 ,1 ), labels= c ("No" , "Yes" ))
# plot
ggplot (bw, aes (x= lwt, y= bwt)) + geom_point ()
Scatterplots: Discrete Color
ggplot (bw, aes (x= lwt, y= bwt, color= smoke)) + geom_point ()
Scatterplots: Discrete Shapes
ggplot (bw, aes (x= lwt, y= bwt, shape= smoke)) + geom_point (size= 4 )
Scatterplots: Specifying Shape
scale_shape_manual() is used to specify the values of the pch
ggplot (bw, aes (x= lwt, y= bwt, shape= smoke)) + geom_point (size= 4 ) + scale_shape_manual (values= c (1 ,16 ))
Scatterplots: Adding a Stat Line
stat_smooth() is used to add a line which represents the statistical procedure specified by the “method” argument
ggplot (bw, aes (x= lwt, y= bwt)) + geom_point (size= 4 ) +
stat_smooth (aes (color= smoke), method= lm, lwd= 4 )
ggplot (bw, aes (x= lwt, y= bwt)) + geom_point (size= 4 ) + stat_smooth (aes (lty= smoke), method= lm, lwd= 4 , color= "red" )
geom_smooth()
geom_smooth() adds a loess smoothing line to our plot
Loess = Locally Estimated Scatterplot Smoothing
ggplot (bw, aes (x= lwt, y= bwt)) + geom_point (size= 4 ) +
geom_smooth (se= FALSE , lwd= 2 , color= "red" )
Labels + Titles
pc <- read.csv ('/Users/christopherwolfe/Library/CloudStorage/GoogleDrive-chriswolfe93091@gmail.com/My Drive/ECU_Courses/Spring2025/ANTH_Stat/data/goldman_pc.csv' )
library (magrittr)
pc %<>% filter (Inst %in% c ("DC" , "KSU" , "NM" , "WOAC" ))
example_plot <- ggplot (pc, aes (x= RRML, y= lhml, color= Inst)) + geom_point (size= 3 ) + scale_color_manual (values= c ("red" ,"green" ,"goldenrod" ,"purple" ))
example_plot <- example_plot + labs (x= "Radius max. length" , y= "Humerus max. length" , title= "Scattered" , color= "Institution" )
example_plot
Axes
example_plot + xlim (200 , 300 ) + ylim (250 , 350 )
Themes
Themes are used to adjust the visual appearance of elements in your plot.
element_blank()
element_line()
element_rect()
element_text()
element_blank() is used to remove a specified element from the plot
element_text() is the most used
plot + theme_bw () + theme (axis.title.x = element_text (colour= "red" , size= 12 ))
element_text()
element_text() is used to change fonts, sizes, justification, angles, and more
help(element_text()) for more options
TIP: Load ggplot2 and type help(theme) for all theme elements
theme (title= element_text (size= 16 )
theme (plot.title= element_text (size= 22 )
theme (axis.title= element_text (size= 16 )
theme (axis.text= element_text (size= 12 )
theme (legend.text= element_text (size= 12 )
Pre-Made Themes
These themes are often good starting points for creating a custom theme
Premade Themes
Find your style!
Code
mytheme <- theme_bw () + theme (panel.grid.major= element_blank (), panel.grid.minor= element_blank (), legend.background= element_blank (), legend.box.background= element_rect (color= 'black' ), legend.key= element_blank (), legend.title= element_text (face= 'plain' ,size= 14 ), legend.text= element_text (size= 12 ), axis.title= element_text (size= 15 , lineheight= .9 , vjust= .3 ), axis.text= element_text (size= 12 ), axis.title.x= element_text (vjust= .2 ), axis.title.y= element_text (vjust= .3 ), plot.title= element_text (size= 18 ), strip.text= element_text (size= 12 ))